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Abstract 

We show how to incorporate fluctuations of the recombination rate along the chromosome into stan- 
dard gene-genealogical models for the decorrelation of gene histories. This enabl es us to determine h ow 
small-scale fluctuations (Poissonian hot-spot model) and large-scale variations 



Kong et al. 



2002) of 



the recombination rate influence this decorrelation. We find that the empirically determined large-scale 
variations of the recombination rate give rise to a significantly slower decay of correlations compared 
to the standard, unstructured gene-genealogical model assuming constant recombination rate. A model 
with long-range recombination-rate variations and with demographic structure (divergent population) is 
found to be consistent with the empirically observed slow decorrelation of gene histories. Conversely, we 
show that small-scale recombination-rate fluctuations do not alter the large-scale decorrelation of gene 
histories. 



Genome-wide variation and decorrelation of gene histories are reflected in patterns of lin kage disequi- 



librium which in turn shape the genetic variation observed on the molecular level. Recently Reich et al. 



2002 ) reported on the first genome- wide measurement of correlations of human gene histories. IREICH et al 



20021) show that their data are inconsistent with standard gene-genealogical models allowing for non- 
trivial population structures and demographic schemes, but assuming a constant recombination rate over 
the genome. The question is thus: can fluctuations of the recombination rate along the chromosome explain 
the slow correlation deca y of gene histories? 



Empirical results dCHAKRAVARTl et al 



1984 



Goldstein 



2001 



Jeffreys et al. 



2001) indicate 



that the recombination rate is not constant along the chromosome. It was observed that, at certain loca- 
tions, an appreciable fr action of recombination events ar e concentrated in short regions [roughly lkb long 



and spaced 50kb apart lISTUMPF and GOLDSTEIN 



20031) 1. so-called hot spots. At least locally this implies 



small-scale (< lOOkb) variations of the recombination rate along the chromosome. Genom e-wide, long 



range fluctuations of the recombination rate for humans have been empirically determined by 



Kong et al. 



1 20021) . It is thus necessary to incorporate the effect of 



dard gene-genealogical model I GRIFFITHS 



1981 



Hudson 



)2) 



fluctuating recombination rates into the stan 



1983 



Tavare . 



1984 



Kaplan and Hudson 



1985 



Hudson 



1990; 



Nordborg and Tavare . 



2002). More generally, it is necessary to determine: on 



which length scales do recombination-rate fluctuations at a certain scale infl uence the decorrelation function 



2002) that small-scale recombination- 



of gene histories most significantly? It has been argued ( R eich et al. 
rate fluctuations (< lOOkb) related to hot spots are an important if not the main feature determining the slow 
decorrelation of gene histories (assuming that hot-spots are to be found genome-wide). 

Here we derive an expression for the correlation of gene histories in neutral gene-genealogical mod- 
els allowing for fluctuating recombination rates. This enables us to explain and quantitatively describe 
the influence of recombination-rate fluctuations on the correlation of gene histories. We find that large- 



2002) give rise to a sig- 



scale fluctuations of empirically determined recombination rates (IKONG et al. 
nificantly slower decay of correlations compared to the standard, unstructured, constant population-size 
gene-genealogical model assuming constant recombination rate. Furthermore, a model with large-scale 



recombination-rate fluctuations and with demographic str ucture [divergent population, see JEyre- WALKER et al 



1998 



Teshima and Tajima . 



2002; 



Reich et all 



2002) and references cited therein] is found to be con- 



sistent with the empirically observed decorrelation of gene histories. It is not neces s ary to invoke hot spots 



In a neutral model, 



Kaplan and Hudson! i 19851) [see also 



1981)] have derived a 



relation between the correlation function p Tx .T y of the times t x and r y to the most recent common ancestors 
of two loci x and y, and the amount C of recombination between these two loci 1 . We observe that their 
result depends on the total amount of recombination between the two loci, but not on the distribution of 
recombination events between these loci. Moreover, this is still true when population structure is taken into 
account. The expected correlation p c ^ Ty is obtained by averaging with a sliding window of length \y — x\ 
along the chromosome. Thus, if px(C) is the genome-wide distribution of recombination intensity C in 
bins of lengths X = \y — x\, the expected correlation is 



Pr 



cxp 



AC Px (C)p T ^ Ty (C) 



(1) 



It also follows that small-scale fluctuations of the recombination rate on length scales much smaller than X 



'The result of KA PLAN and HUD SON 1 1985) for the unstructured, constant population-size model is exact for sample size n = 2; 
for large n it is a very good approximation. 



are irrelevant to the decay of correlations on scales of the order of X. In particular, fluctuations due to hot 
spots at small scales cannot change the decorrelation of gene histories at much larger scales. 

Using we have computed p c ^ p Ty in four models (Fig.Qi: assuming small-scale variation of the 
recombination rate (model I), inco rporating, in additio n, large-scale variation (model II), and estimating 



Px(C) from the empirical data of 



Kong et al 



(2002) (model III), and, in addition, taking into account 



demographic population structure (model IV). Model I is the Poissonian hot-spot model of REICH et al. 



(2002), described in more detail in Fig. la below. From Q we obt ain an explicit 



Reich et al. 



expre ssion for P°* p Th 



(2002) we find that on 



(caption of Fig. la). This result is shown in Fig. 2a. In agreement with 
distances of the order of the hot-spot spacing, correlations are larger than those in a constant recombination- 
rate model. However, there is no choice of parameters which could explain the empirically observed decor- 



relation function, (c.f. data in Fig. 6a of 



Reich et al. 



(2002), reproduced in Fig. 2b below). In particular, 



no sig nificant increase in co rrelations on length scales ^> A 1 is observed, as discussed above. 



Reich et al. 



(2002J) have fitted an "arbitrary mixed model" to their empirical data. In order to 
obtain these results it is necessary to introduce large-scale variations of the recombination rate, on a scale 
L ^ X ~ 1Mb. One possibility (model II) is to assume that hot-spots occur in clusters, with long (;§> 1Mb) 
regions of low recombination intensity between them, see Fig. lb. This model provides a better fit to the 
empirical data (Fig. 2b) than model I, indicating that large-scale fluctuations of the recombination rate are 
important. Notice that assuming px{C) = (1 — p) S(C — RqX) + pS(C — R\X) can produce an equally 
good fit to the data (e. g. for p = 0.55, R = 1.2cM/Mb and Ri = 0.02cM/Mb, not shown). In this 
model the recombination rate is constant on large scales (;§> 1Mb) and alternates between two values i?o 
and However this mod el is not consistent with the empirically observed px(C). We have estimated 



px (C) from empirical data JKONG et al. 



2002) (Fig.QJ, model III). The corresponding results are shown in 
Fig. 2b. We find that the empirically determined large-scale fluctuations of the recombination rate give rise 
to significantly enhanced correlations (compared to the standard model assuming constant recombination 
rate), especially at large distances. 

It is expected that population structure can increase the correlations of gene histories at large dis- 
tances. We have considered the effect of large-scale recombination-rate fluctuations within a well-established 
model of demographic structure: the population was of constant size N until tq generations ago, when it 
split into two fractions of size 7./V and (1 — 7) iV". The two sub-populations remained sepa r ate until a re 



cent merging (see for instance 



Eyre-Walker et al. 



(1998); 



Teshima and TajimaI (|2002); 



Reich et al. 



2002 ) and references therein). F or sample size n = 2 we have calculated p Tm :Ty {C) explicitly in this model 



(Eriksson and Mehlig 



2004). Without recombination-rate fluctuations, this model does not describe the 



empirically observed correlation of gene hi stories (IREICH et al 



large-scale recombination-rate fluctuations jKONG et al 



2002). We have determined the effect of 



2002) on the correlation of gene histories in this 



model using eq. (1) and the explicit expression for p TxtT (C). The parameters of the model (tq and N) 
where chosen to be consist ent with the empiric ally estimated time to the most recent common ancestor and 



its coefficient of variation ( REICH et al 



2002). The parameter 7 was set to 0.3. The resulting correlation 
function matches the empirical data reasonably well. Decreasing 7 gives rise to decreased correlations 
(7 = corresponds to the standard model without demographic structure). 

In summary we have determined the influence of recombination-rate fluctuations on the decorrela- 
tion of gene histories. We find that small-scale fluctuations are irrelevant to long-range correlation decay. 
Empirically determined large-scale fluctuations of the recombination rate, however, are found to signifi- 
cantly increase the correlations. Within a model with demographic structure, large-scale fluctuations of 
empirically determined recombination rates significantly contribute to the empirically observed slow decay 
of correlations. 

We conclude by discussing the implications of our results for the study of genome-wide variabil- 
ity as reflected in single-nucleotide polymorphism (SNP) statistics. Eq. (1) determines the effect of 
recombination-rate fluctuations on p c ^ v Ty - This quantity, in turn, determines the genome- wide statistics 
of SNP locations: the variance of the number of SNPs in bins of lengths I along the chromosomes is de- 
termined by the i ntegral of pt* p ^ over x and y from to /, i.e. by how fast the correlations d ecay on 



scales of length I (IHUDSON 



1990) 



The International SNP Map Working Group! OOP lb has em 



pirically determined the variance of the number of SNPs in short reads (of average length 500bp), the result 
was found to be consistent with the standard, unstructured gene-genealogical model assuming a constant 
recombination rate. This is consistent with our results (Fig. 2b): on scales of the order of 500bp, the 
recombination-rate fluctuations have little effect on the correlation function. We expect, however, that in 
order to understand the statistics of SNP counts in longer bins, it will be necessary to account for long-range 
recombination-rate fluctuations. 
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Figure 1: Description of models, a Model I: Recombination events occur at hot-spots (of zero width) 
with rate Ri = R/X. The remaining fraction occurs uniformly with rate Rq = (l—p)R. The num- 
ber of hot-spots in a locus of length I is Poisson distributed with rate XL Eq. gives p c *f Ty = 
e~ xx Y^L {XX) k / k\ (Rik + 18) /[(Rik) 2 + 13 R^+ 18}. This result is exact for n = 2. b Model II: 
hot-spots occur in clusters of size pL, separated by empty regions of length (1 —p)L, < p < 1, and L is 
a typical length scale (of the order of several Mb). Within a cluster, the number of hot-spots is Poisson dis- 
tributed, c Model HI: the genome-wide d istribution px (C) is obtained by sampling C = g(x + X) — g(x) 



from empirical data jKONG et al 



2002) on the cumulative genetic distance g{x) by randomly choosing 



physical posi t ions x . The curve g(x) is obtained from the Nature Genetics web supplement NG917-S13 



i Kong et at 



2002), from columns 1 (physical distance) and 3 (sex-averaged genetic distance) assuming 
an effective population size of N = 10 4 , by ignoring entries labeled "NA", and by shifting the origin of 
both physical and genetic distances so that g(Q) = 0. Shown here is px(C) for chromosome 5; X = 200kb 
(red) and 1Mb (blue), d Demographic structure (divergent population): it is assumed that the population 
size was N until time To in the past when it split into two populations of sizes jN and (1 — j)N. The 
parameters tq and N are chosen to be consistent with empirical dat a on the mean of th e time to the most 



common recent ancestor and its coefficient of variation; table 1 in (Rei ch et al. 



2002). The asymmetry 



parameter va ries between and 1/2. For sam ple size n = 2, the function p T[c ,T y (C) for this model was 



calculated bv lERlKSSON and MehligI (12004 ). 
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Figure 2: Decorrelation of gene histories p Q ^ c v Ty ■ a Model I (red lines), R = 4Nr, N = 10 4 , and 
r = 1.2cM/Mb. The dashed line corresponds to constant recombination rate, b Model II (blue line), 
p = 0.55, A -1 = 50kb, r = 1.2cM/Mb, and L » X, model III (red line), no fitting parameters, sex- 
av eraged py(C) , mode l IV (green line) for 7 = 0.3 (asymmetric split), empirical data (taken from Fig. 6a 



in (IREICH et al 
line). 



2002), upper and lower confidence limits, points), constant recombination rate (dashed 
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